{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building your own Spotify Wrapped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will generate a custom version of the top artists, songs, and trends over the year based on our downloadable spotify personal history.\n",
    "\n",
    "You can request your data from Spotify [via this link.](https://www.spotify.com/uk/account/privacy/) Make sure to check your extended data. This process can take up to a month so you will have to wait for a few weeks before your json files are generated and sent to you. You can then add these files in the `data` folder to run the indexing process and build your own dashboard. \n",
    "\n",
    "Alternatively, you can test the notebook with the mini sample data provided.\n",
    "\n",
    "![](/img/spotify.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exploring Spotify Streaming Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once data has been exported we can take a look at the stats. Spotify provides some helpful metadata to help understand the format:\n",
    "\n",
    "![](img/spotify%20schema.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do a quick test to view our data - only selecting certain columns for some personal data privacy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "\n",
    "cols = [\n",
    "    \"ts\",\n",
    "    \"ms_played\",\n",
    "    \"master_metadata_track_name\",\n",
    "    \"master_metadata_album_artist_name\",\n",
    "    \"master_metadata_album_album_name\",\n",
    "]\n",
    "file_name = \"data/sample_data.json\"\n",
    "\n",
    "with open(file_name, \"r\") as file:\n",
    "    data = json.load(file)\n",
    "\n",
    "df = pd.DataFrame(data=data, columns=cols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ts</th>\n",
       "      <th>ms_played</th>\n",
       "      <th>master_metadata_track_name</th>\n",
       "      <th>master_metadata_album_artist_name</th>\n",
       "      <th>master_metadata_album_album_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2023-07-04T12:42:02Z</td>\n",
       "      <td>247000</td>\n",
       "      <td>Little Lion Man</td>\n",
       "      <td>Mumford &amp; Sons</td>\n",
       "      <td>Sigh No More (Benelux Edition)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2023-07-04T12:44:09Z</td>\n",
       "      <td>40375</td>\n",
       "      <td>Girlfriend</td>\n",
       "      <td>Avril Lavigne</td>\n",
       "      <td>The Best Damn Thing (Expanded Edition)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2023-07-04T12:47:22Z</td>\n",
       "      <td>193226</td>\n",
       "      <td>Better Love - From \"The Legend Of Tarzan\" Orig...</td>\n",
       "      <td>Hozier</td>\n",
       "      <td>Better Love</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2023-07-04T12:50:49Z</td>\n",
       "      <td>206546</td>\n",
       "      <td>Talk</td>\n",
       "      <td>Hozier</td>\n",
       "      <td>Wasteland, Baby!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2023-07-04T12:59:55Z</td>\n",
       "      <td>331274</td>\n",
       "      <td>No Plan</td>\n",
       "      <td>Hozier</td>\n",
       "      <td>Wasteland, Baby!</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     ts  ms_played  \\\n",
       "0  2023-07-04T12:42:02Z     247000   \n",
       "1  2023-07-04T12:44:09Z      40375   \n",
       "2  2023-07-04T12:47:22Z     193226   \n",
       "3  2023-07-04T12:50:49Z     206546   \n",
       "4  2023-07-04T12:59:55Z     331274   \n",
       "\n",
       "                          master_metadata_track_name  \\\n",
       "0                                    Little Lion Man   \n",
       "1                                         Girlfriend   \n",
       "2  Better Love - From \"The Legend Of Tarzan\" Orig...   \n",
       "3                                               Talk   \n",
       "4                                            No Plan   \n",
       "\n",
       "  master_metadata_album_artist_name        master_metadata_album_album_name  \n",
       "0                    Mumford & Sons          Sigh No More (Benelux Edition)  \n",
       "1                     Avril Lavigne  The Best Damn Thing (Expanded Edition)  \n",
       "2                            Hozier                             Better Love  \n",
       "3                            Hozier                        Wasteland, Baby!  \n",
       "4                            Hozier                        Wasteland, Baby!  "
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connecting to your Elastic cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "from elasticsearch import Elasticsearch, helpers\n",
    "from getpass import getpass\n",
    "\n",
    "# Connect to the elastic cloud server\n",
    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
    "ELASTIC_API_KEY = getpass(\"Elastic API Key: \")\n",
    "\n",
    "# Create an Elasticsearch client using the provided credentials\n",
    "client = Elasticsearch(\n",
    "    cloud_id=ELASTIC_CLOUD_ID,  # cloud id can be found under deployment management\n",
    "    api_key=ELASTIC_API_KEY,  # your username and password for connecting to elastic, found under Deplouments - Security\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding the documents into an elasticsearch index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once your data is available you can add your documents in a local folder. In my example I put my json files for the 5 years of data history I got into the `data` folder.\n",
    "For the purpose of this demo notebook I have also added a simplified sample of my streaming data with some hidden fields for data privacy that can be used as an example to run the following cells.\n",
    "\n",
    "![](/img/data.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'spotify-history-eli'})"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "index_name = \"spotify-history\"\n",
    "\n",
    "# Create the Elasticsearch index with the specified name (delete if already existing)\n",
    "if client.indices.exists(index=index_name):\n",
    "    client.indices.delete(index=index_name)\n",
    "client.indices.create(index=index_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now open these files with a json reader and directly generate documents for our elasticsearch index from the files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_docs(DATASET_PATH):\n",
    "    with open(DATASET_PATH, \"r\") as f:\n",
    "        json_data = json.load(f)\n",
    "        documents = []\n",
    "        for doc in json_data:\n",
    "            documents.append(doc)\n",
    "        load = helpers.bulk(client, documents, index=index_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import required module\n",
    "import os\n",
    "\n",
    "# assign directory\n",
    "directory = \"data\"\n",
    "file_list = []\n",
    "for filename in os.listdir(directory):\n",
    "    f = os.path.join(directory, filename)\n",
    "    if os.path.isfile(f) and f.endswith(\".json\"):\n",
    "        file_list.append(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "for DATASET_PATH in file_list:\n",
    "    generate_docs(DATASET_PATH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the data is added into elastic, a mapping is automatically generated. The data from Spotify is already high quality so this mapping is accurate enough by default that we don't need to pre-define it manually. The main important detail to pay attention to is that fields like artist name also generate as a `keyword` which will enable us to run more complex aggregations in the following steps. \n",
    "\n",
    "Here's what it will look like on the Elastic side:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mappings = {\n",
    "    \"properties\": {\n",
    "        \"conn_country\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"episode_name\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"episode_show_name\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"incognito_mode\": {\"type\": \"boolean\"},\n",
    "        \"ip_addr\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"master_metadata_album_album_name\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"master_metadata_album_artist_name\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"master_metadata_track_name\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"ms_played\": {\"type\": \"long\"},\n",
    "        \"offline\": {\"type\": \"boolean\"},\n",
    "        \"offline_timestamp\": {\"type\": \"long\"},\n",
    "        \"platform\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"reason_end\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"reason_start\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"shuffle\": {\"type\": \"boolean\"},\n",
    "        \"skipped\": {\"type\": \"boolean\"},\n",
    "        \"spotify_episode_uri\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"spotify_track_uri\": {\n",
    "            \"type\": \"text\",\n",
    "            \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
    "        },\n",
    "        \"ts\": {\"type\": \"date\"},\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## We can now run queries on our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "We get back 5653 results, here are the first ones:\n",
      "My Love Will Never Die\n",
      "Angel Of Small Death & The Codeine Scene\n",
      "Someone New - Live\n"
     ]
    }
   ],
   "source": [
    "index_name = \"spotify-history\"\n",
    "query = {\"match\": {\"master_metadata_album_artist_name\": \"Hozier\"}}\n",
    "\n",
    "# Run a simple query, for example looking for problems with the engine\n",
    "response = client.search(index=index_name, query=query, size=3)\n",
    "\n",
    "print(\n",
    "    \"We get back {total} results, here are the first ones:\".format(\n",
    "        total=response[\"hits\"][\"total\"][\"value\"]\n",
    "    )\n",
    ")\n",
    "for hit in response[\"hits\"][\"hits\"]:\n",
    "    print(hit[\"_source\"][\"master_metadata_track_name\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### My top artists of all time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'key': 'Hozier', 'doc_count': 5653}\n",
      "{'key': 'Ariana Grande', 'doc_count': 1543}\n",
      "{'key': 'Billie Eilish', 'doc_count': 1226}\n",
      "{'key': 'Halsey', 'doc_count': 1076}\n",
      "{'key': 'Taylor Swift', 'doc_count': 650}\n",
      "{'key': 'Cardi B', 'doc_count': 547}\n",
      "{'key': 'Beyoncé', 'doc_count': 525}\n",
      "{'key': 'Avril Lavigne', 'doc_count': 469}\n",
      "{'key': 'BLACKPINK', 'doc_count': 413}\n",
      "{'key': 'Paramore', 'doc_count': 397}\n"
     ]
    }
   ],
   "source": [
    "aggs = {\"mydata_agg\": {\"terms\": {\"field\": \"master_metadata_album_artist_name.keyword\"}}}\n",
    "\n",
    "response = client.search(index=index_name, aggregations=aggs)\n",
    "for hit in response[\"aggregations\"][\"mydata_agg\"][\"buckets\"]:\n",
    "    print(hit)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Artists of 2024 by # of times playes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Linkin Park played 271 times.\n",
      "Hozier played 268 times.\n",
      "Dua Lipa played 112 times.\n",
      "Taylor Swift played 106 times.\n",
      "Måneskin played 61 times.\n",
      "Avril Lavigne played 55 times.\n",
      "Evanescence played 40 times.\n",
      "Paramore played 35 times.\n",
      "The Pretty Reckless played 34 times.\n",
      "Green Day played 33 times.\n"
     ]
    }
   ],
   "source": [
    "body = {\n",
    "    \"query\": {\"range\": {\"ts\": {\"gte\": \"2024\", \"lte\": \"2025\"}}},\n",
    "    \"aggs\": {\n",
    "        \"top_artists\": {\"terms\": {\"field\": \"master_metadata_album_artist_name.keyword\"}}\n",
    "    },\n",
    "}\n",
    "\n",
    "response = client.search(index=index_name, body=body)\n",
    "for hit in response[\"aggregations\"][\"top_artists\"][\"buckets\"]:\n",
    "    print(\n",
    "        \"{artist} played {times} times.\".format(\n",
    "            artist=hit[\"key\"], times=hit[\"doc_count\"]\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Top artists by amount of time played"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hozier played 268 times; for a total of 13.69 hours\n",
      "Linkin Park played 271 times; for a total of 12.21 hours\n",
      "Dua Lipa played 112 times; for a total of 4.43 hours\n",
      "Taylor Swift played 106 times; for a total of 4.39 hours\n",
      "Måneskin played 61 times; for a total of 2.48 hours\n",
      "Avril Lavigne played 55 times; for a total of 2.05 hours\n",
      "Evanescence played 40 times; for a total of 1.96 hours\n",
      "Adele played 27 times; for a total of 1.6 hours\n",
      "Billie Eilish played 32 times; for a total of 1.57 hours\n",
      "Green Day played 33 times; for a total of 1.5 hours\n"
     ]
    }
   ],
   "source": [
    "body = {\n",
    "    \"size\": 0,\n",
    "    \"query\": {\"range\": {\"ts\": {\"gte\": \"2024\", \"lte\": \"2025\"}}},\n",
    "    \"aggs\": {\n",
    "        \"top_artists\": {\n",
    "            \"terms\": {\n",
    "                \"field\": \"master_metadata_album_artist_name.keyword\",\n",
    "                \"order\": {\"minutes_played\": \"desc\"},\n",
    "            },\n",
    "            \"aggs\": {\"minutes_played\": {\"sum\": {\"field\": \"ms_played\"}}},\n",
    "        }\n",
    "    },\n",
    "}\n",
    "\n",
    "response = client.search(index=index_name, body=body)\n",
    "for hit in response[\"aggregations\"][\"top_artists\"][\"buckets\"]:\n",
    "    print(\n",
    "        \"{artist} played {times} times; for a total of {hours} hours\".format(\n",
    "            artist=hit[\"key\"],\n",
    "            times=hit[\"doc_count\"],\n",
    "            hours=round(hit[\"minutes_played\"][\"value\"] / 3600000, 2),\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## From here - you can read the [blog](/Spotify%20Wrapped%20Iulia's%20Version.md) on how to build the visualizations"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
